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Abstract 

Presence-only data are referred to situations in which, given a censoring mecha- 
nism, a binary response can be observed only with respect to on outcome, usually 
called presence. In this work we present a Bayesian approach to the problem of 
presence-only data based on a two levels scheme. A probability law and a case- 
control design are combined to handle the double source of uncertainty: one due 
to the censoring and one due to the sampling. We propose a new formalization for 
the logistic model with presence-only data that allows further insight into inferential 
issues related to the model. We concentrate on the case of the linear logistic re- 
gression and, in order to make inference on the parameters of interest, we present a 
Markov Chain Monte Carlo algorithm with data augmentation that does not require 
the a priori knowledge of the population prevalence. A simulation study concerning 
24,000 simulated datasets related to different scenarios is presented comparing our 
proposal to optimal benchmarks. 

Keywords: Bayesian modeling, case-control design, data augmentation, logistic regres- 
sion, Markov Chain Monte Carlo, population prevalence, presence-only data, simulation. 

1 Introduction 

There is a significant body of literature in statistics, econometrics and ecology dealing with 
the modeling of discrete responses under biased or preferential sampling designs. They 
are particularly popular in the natural sciences when species distributions are studied. 
Such sample schemes may reduce the survey cost especially when one of the responses is 



rare. A large part of statistical literature concerns the case-control design, retrospective, 
choice-based or response-based sampling (Lancaster and Imbens, 1996). In the simplest 



case a sample of cases and a sample of controls are available and for each observation a 
set of "attributes/covariates" is observed in both samples. Then inference is carried out 
following standard statistical procedures (Armenian, 2009). 

A case that has received increasing attention in the literature is the situation where the 
sample of controls is a random sample from the whole population with information only 
on the attributes and not on the response (Lancaster and Imbens, 1996). This situation 



is fairly common in ecological studies where only species' presence is recorded when field 
surveys are carried out. In the ecological literature, since the 1990's such data are called 
presence only data (see Araujo and Williams, 2000, and references therein). |Pearce and 



Boyce (2006) define presence-only data as "consisting only of observations of the organism 
but with no reliable data where the species was not found". Atlases, museum and herbar- 
ium records, species lists, incidental observation databases and radio-tracking studies are 
examples of such data. 

In recent years we find a considerably growing literature describing approaches to the 



modeling of this type of data, among the many ecological papers we recall Keating and 



Cherrj 


r (2004), 


Pearce and Boyce (2006 


), 


Elith et al. (2006), ] 


51ith and Leathwick 


(2009), 


Franklin (2010 


and, most notably, in the statistical literature 


Ward et al. ( 


2009), 


Warton 


and Shepherd 


(2010), Chakraborty et al. 


(: 


2011), 


Di Lorenzo et al. 


(2011 


) and 


Dorazio 


(2012) 


. While i 


n Warton and Shepherd 


(20K 


)) and 


Chakraborty et al. 


(2011 


) to model the 



presence-only data Poisson point processes are considered in the likelihood and Bayesian 
framework respectively, in Ward et al. (2009) and Di Lorenzo et al. (2011) a modified case- 
control logistic model is adopted in the likelihood and Bayesian perspective respectively, 
in both papers there is no account for possible dependence structure in the observations. 
In Dorazio (2012) the asymptotic relations between the two approaches are discussed. 



A different approach, MaxEnt, is based on the maximum entropy principle ( Jaynes, 1957). 
In MaxEnt (Phillips et al., 2006, Elith et al., 2011) the relative entropy between the dis- 
tribution of covariates at locations where the species is present and the unconditional 
background distribution of covariates is maximized subject to some constrains concerning 
empirical statistics (see Philips et al., 2006, for details). As pointed by Dorazio (2012) "the 
MaxEnt method requires knowledge of species' prevalence for its estimator of occurrence 
to be consistent". 

In what follows we are going to use the name presence-only data when referring to the 
above sketched general problem of having information on the presence and covariates 
jointly on a sample from a population, while information on only the covariates is avail- 
able on any sample from the same population. This work is developed in the same discrete 
setting as in Ward et al. (2009) and Di Lorenzo et al. (2011), i.e., we have a population 
of independent units, no dependence structure, such as spatial correlation, is anticipated. 
We defer the treatment of this extension to a subsequent work. 

The main contribution of the paper is a new rigorous formalization of the logistic regres- 
sion model with presence-only data that allows further insight into the inferential issues. 
This leads us to an algorithmic procedure that, among other results, returns a MCMC ap- 
proximation of the response prevalence under general knowledge of the process generating 



2 



the data. We also present a large simulation study involving 24,000 simulated datasets 
and comparing our approach to other two models representing optimal benchmarks. 
The paper is organized as follows. Section [2] introduces a general framework for the 
presence-only data problems, Section [3] presents our Bayesian approach, Section [4] de- 
scribes the MCMC algorithm while results related to the simulation study are reported 
in Section [5} Finally in Section [6] some conclusions are drawn and future developments 
briefly described. 



2 Linear logistic regression for presence-only data. 



The analysis of a binary response related to a set of explicative covariates is usually 
carried out through the use of the logistic regression where the logit of the conditional 
probability of occurrence is modeled as a function of covariates. In this section, we first 
introduce a general framework for the modeling of presence-only data and then consider 
the case of the linear logistic regression. The approach proposed is built on two levels 



and we partially follow the formulation introduced by [Ward et al. (2009) but adopting a 
Bayesian scheme as in Divino et al. (2011). 



2.1 A two level approach. 

Let Y be a binary variable informing on the presence (Y = 1) or absence (V = 0) of 
a population's attribute and let X = (X\, ...,X)~) denote a set of highly informative, 
on the same attribute, covariates which are available on the same population. Then, the 
presence-only problem can be formalized by considering a censorship mechanism that acts 
when observing the response Y, so that part of the population units are not reachable. 
In particular, we refer to the situation in which we are able to detect only a partial set of 
units on which the attribute of interest is present while the information on the covariates 
X is available on the entire population. In this situation, we have to consider two types of 
uncertainty: the uncertainty due to the mechanism of censorship and the uncertainty due 
to the sampling procedure. Moreover, since we are not able to collect a random sample 
of observable data, we need to adjust for the sampling mechanism through the use of a 



case-control scheme (Breslow and Dey 



, 1980 


Breslow 


2005 


Armenian 


2009 



In order to build a Bayesian model, in this framework we adopt the following conceptual 
scheme in two levels. 



Level 1. Given the population of interest U of size N, the binary responses y = 
(yi, ...jUn) arc generated independently by a probability law M.. 



Level 2. Let IA V be the subset of U where we observe Y — 1. A modified case-control 
design is applied so that a sample of presences, considered as cases, is selected from U p 
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and a sample of "contaminated" controls (Lancaster and Imbens, 1996) is selected from 
the whole population U, with all the covariates but no information on Y. 



Here, we cannot approach the model construction using only a finite population ap- 



proach (Sarndal, 1978) because of the censoring mechanism that "masks" distributional 
information on Y already at the population level. By the introduction of Level 1 we 
can describe the censored observations as random quantities generated by the model Ai. 
Hence, the problem of presence-only data can be formalized as a problem of missing data 



(Rubin, 1976, Little and Rubin, 1987). 



2.2 The model generating population data. 

At the first level, we assume that the law M. is defined in terms of the conditional 
probability of occurrence Pr(Y = l\x), denoted by tt*(x), when the covariates are X = 
x. Moreover, we consider that the relation between Y and X is formalized through a 
regression function <f>(x) on the logit scale 



(x) = logit TT*(x), 



that is 



TT (X) 



(1) 



(2) 



1 + e^ x ) ' 

When the data y = (y\, y^) are independently generated from A4, we denote by tt the 
empirical prevalence of the binary response Y in U, expressed as the ratio of the number 
of presences N% to the size of the population, that is 



TT 



N 



2.3 The modified case-control design. 



At the second level, we adopt a case-control design modified for presence-only data (Lan- 



caster and Imbens, 1996) in order to account for the specific sampling procedure consid- 



ered. The use of the case-control scheme is necessary at all times when it is appropriate 
to select observations in fixed proportions with respect to the values of the response vari- 
able. This can occur when the attribute of interest represents a phenomenon that is rare 
among the units of the population as for example a rare disease or a rare exposure in 



epidemiological studies (Woodward, 2005). 



Now, let C be a binary indicator of inclusion into the sample (C = 1 denotes that a unit 
is in the sample), let p — Pr(C — 1\Y — 0) and p\ = Pr(C — 1\Y — 1) be the inclusion 
probability of the absences and the presences, respectively. Under the assumption that, 
given Y, the sampling mechanism is independent from the covariates X, the conditional 
probability of occurrence is modified through the Bayes rule as 



Pr{Y = l\C = l,x) 



0(x) 



Pie 
Po + Pie' 



(*)■ 



(3) 
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Hence, the corresponding case-control regression function <f) cc (x) denned as the logit of 
([3]) is given by 

cc (x) = 0(x) + log^. (4) 
Po 

In particular, if the selection of cases (ni) and controls (n ) is made independently without 
replacement, the inclusion probabilities are given in terms of the empirical prevalence tt 
by 



n 

Po = T- 



and 



- n)N 

m 



so that the equation (|4]) becomes 



7lN 



4>cc{x) = <p{x) + log— -log— — . (5) 
n 1 — 7r 

In our framework, since the response variable Y is already censored at the population 
level, the standard case-control design cannot be adopted but it should be modified in 
such a way that a sample of presences is matched with an independent sample drawn 



from the entire population, named the background sample (Zaniewski et ai, 2002, Ward 



et ai, 2009). Remark that in this sample the response variable is unobserved and only 
the covariates are available. 

In this way, the complete sample S is composed by a set S u of n u independent background 
data, where the response Y is not observed, drawn from the entire U and by a set S p 
of n p independent observations selected from the sub-population of presences U p . This 
procedure implies that the reference population U is augmented with its subset IA V so that 
the total number of observations considered in the sampling scheme becomes N + N x . 
To illustrate the sampling framework we are going to adopt here, let us consider the 
following situation: we can label population units of type y = 1 only when they are 
isolated from units of type y = 0. This can be formalized by introducing a binary stratum 
variable Z such that Z = indicates when an observation is drawn from the entire 
population U while Z = 1 denotes the sampling from the sub-population U p . Remark 
that Z = 1 implies Y = 1 whilst Z = implies that Y is an unknown value y G {0, 1}. 
Moreover, by construction Z is independent from the covariates X, given the response 
Y. The introduction of the variable Z allows us to define the structure of the data 
at the population level and at the sample level in terms of presences/absences (Y) and 
known/unknown data (Z), as reported in Table [T] and Table [2j 



Y/Z 


Z = 


Z = 1 


Total 


Y = 


iV 





N 


Y = 1 


Ni 




2Nx 


Total 


N 




N + N x 



Table 1: Data structure at the population level. 
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Y/Z Z = Z = 1 Total 



K = n « n 
K = 1 nxu n p rix 

Total 



Table 2: Data structure at the sample level. 



In Table [TJ Nq is the number of absences in the population U while in Table [2j uq u 
and n\ u respectively denote the unknown frequencies of absences and presences in the 
sub-sample S u . Remark that, in the above described situation, the inclusion probability 
of units with or without the mentioned attribute changes. In fact, while an absence can 
be drawn only when sampling from U, a presence can be selected when sampling both 
from U and from U p . Thus, one has 

The introduction of the stratum variable Z allows us also to exactly derive the logis- 
tic regression model under the case-control design modified for presence-only data. In 
fact, when we consider the population U augmented with its subset U p , the model 7r*(x) 
represents the conditional probability to mark a presence only when Z — 0, that is 
Pr(Y = 1\Z = 0,x) = 7t*(x). On the other hand, when Z — 1, we simply have 
Pr(Y — 1\Z — l,x) — 1. We can prove the following result. 



and 



Proposition 1. Under the assumption that Z is independent from X given Y, one has 

Proof. From the hypothesis of conditional independence it results 

Pr(Z\Y,x) = Pr(Z\Y), 

that can be express also as 

Pr(Y\Z,x)Pr(Z\x) _ Pr(Y\Z)Pr(Z) 
Pr(Y\x) ~ Pr(Y) ' 

Let consider the case with Y = 1 and Z = 0, one has 

Pr(Y = 1\Z = 0, x)Pr(Z = 0\x) _ Pr(Y = 1\Z = 0)Pr(Z = 0) 
PriY = l\x) ~ PriY = 1) ' 

6 



The probabilities enclosed in the second term can be derived from Table [T] and one has 



w*(x)Pr(Z = Q\x) 
Pr(Y = l\x) 



Ni N 
N jV+jVi 
2ATi 



(9) 



N+Ni 



In the case Y — 1 and Z — 1 one similarly obtains 



Pr{Z = l\x) 
Pr(Y = l\x) 



jVj N+Ni 
2JVi 
AT+iVi 



(10) 



From (10) it results Pr(Y = l\x) = 2Pr{Z = l\x) and by substituting in ([9]), one can 



1 + ir*(x) 
Pr(Y = l\x 



and hence PriZ = l\x) 



TT*(x) 



1 + 71* (x) 



Now, it is 



derive that Pr(Z = 0\x) 
simple to obtain that 

□ 

If we assume that, given Y, the inclusion into the sample (C = 1) is independent from 
the covariates X, one hasQ 



2tt*(x) 
1 + n*(x) 



Pr(Y = 0|C = 1, x) Pr(C = l\x) = \ ■ ^J X | p„ 



and 



Pr(Y = 1\C = l,x) Pr(C = l\x) 



1 + ir*(x) 
2tt*(x) 

1 + 7T*(x 



Pi- 



(11) 



(12) 



Then, from the ratio of (12) to (11), it results 



Pr(Y = 1|C= l,x) 2n*(x) p x 
Pr(Y = 0\C = l,x) ~ 1 - n* (x) po' 

and by plugging the quantities po and pi, as defined in ^ and in ([7]), into the logit of 
Pr(Y = 1\C = l,x), one obtains the following relation 



logit Pr(Y = 1\C = l,x) 



log 
log 
log 



2ti* (x) pi 

1 - 71* (x) po 

2n*(x) n lu + n p 1 — n 
l-ir*(x) n 0u 2tt 

7i* (x) riiu + n p 1 — 7r 
l-n(x) n 0u 



71 



logit 71* (x) + log 



(x) + lot 



niu + n p 

n lu + n p 
log 

n 0u 



log 



7T 



7T 



1 - 7T 



1 -7T' 



(13) 



see Appendix for the detailed proof. 
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that represents the logistic regression model under the case-control design for presence- 
only data. As well, we can now formalize the presence-only data regression function 
4> P od(x) as 

4> P od{x) = 00) + log — — - log — — . (14) 

n 0u 1 - 7r 

Although the derivation is substantially different, we end with the same formulation as 
in 



Ward et al. (2009). Now, in order to make parameter estimation possible, we need to 

n lu + n p 1 



handle the ratio 



81 
Po 



TT 



(15) 



n 0u 2tt 

where the quantities tt and n\ u are unknown {n$ u = n u — 

In the recen t literature, two main approaches have been proposed. The first one by |Ward 



et al. 



(2009) replace the ratio 



with the ratio of the expected numbers of presences 



and absences in the sample, that is 

E[n lu + n p ] 1 



Pi 
Po 



71 



ixn u + fi„ 1 



71 



7rn u + n v 



i?[nou] 27r (1 — ir)n u 2ir 2im u 

These authors adopt a likelihood approach and computation is carried out via the EM al- 
gorithm. As they underline, this approximation can be easily implemented if the empirical 
population prevalence tt is known a priori. They discuss also the possibility to estimate 
7r jointly with the regression function when the prevalence is identifiable, as for example 
in the linear logistic regression, and with respect to this case they present a simulation 
example. The difficulty in obtaining efficient joint estimates because of the correlation 
between tt and the intercept of the linear regression term is discussed. Notice that |Ward 



et al. (2009) considers a slightly different representation of the ratio (16), omitting the 
multiplier "2" in the denominator. 



Di Lorenzo et al. (2011 ), dealing with a problem of abundance data, use the approximation 



(16), but they adopt a Bayesian approach and consider the population prevalence tt as a 



further parameter in the model. They choose an informative Beta prior for tt, but their 
MCMC algorithm contains an unusual weakness since the simulation of tt is performed 
from its prior and not from the posterior that can be derived through the interaction 
between the parameter tt and the regression function (p(x). 



A different approximation of the ratio (15) can be obtained by considering the sample 
prevalence in S u (the background sample) 



where 



tt a 



niu 



niu 



n,. 



ieSu 



Due to the censorship process, this quantity is unknown but it would be the maximum 
likelihood estimator for tt if the data y u = {yi, i G S u } could be observed. Now, replacing 



tt by tt u in (15) one obtains 

Pi n lu + n p 1 
Po n 0u 



TT,, 



n lu + n p n u - n lu n lu + n p 



2tt u 



n„ 



n lu 2n lu 



2n 



la 



(17) 
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that allows to formulate a computable version of the regression function for presence-only 
data as 

pod (x)^0(x) + log^^. (18) 

This function depends on the data y u in S u which are not directly observable, but if 
y u is treated as missing data one can enclose it into the estimation process and then 
obtain a consistent approximation for 4> po d{%)- in particular, in a Bayesian framework, 
this idea can be performed by using a Markov Chain Monte Carlo computation with 
data augmentation. Moreover from the use of MCMC simulations we can also obtain an 
approximation of ir u and therefore an estimate of the empirical population prevalence tt. 
Details are given in Section 4. 



The approximation (17) can, in principle, be always adopted, but some care must be 



used as identifiability issues are present. We follow the recommendation in Ward et al 



(2009) to approach jointly estimates of <p{x) and tt only when the latter is identifiable 



with respect to the regression function, as for example in the linear regression case (see 



Ward et al. , 2009 , for mathematical details) 



2.4 The linear logistic regression. 

If we consider a linear regression function (j>{x) = x(3, where = (/?i, .. .,/?&) is the vector 
of the regression parameters, a computable model for presence-only data can be defined 
through the following approximation 

<fipod(x) ~ x(3 + log Ulu + Up , (19) 

or equivalently through the approximation of the conditional probability of occurrence at 
the sample level 



Pr{Y = 1\C = l,x) 



exp{x/3 + log^g^} 
l + exp^ + log^f" 1 } 

l+(l + ^)ex P {z/3}' 



(20) 



In this particular case, all the unknowns of the model are the linear parameters vector 
and the missing data y u in the background sample S u . 



3 The hierarchical Bayesian model. 

Due to the censorship process affecting the data, we can acquire complete information 
only on the stratum variable Z and not on the binary response Y. Then, it seems natural 
to model Z as the observable variable. If we consider the conditional joint distribution of 
Z and Y 

Pr(Z, Y\C=l,x) = Pr(Z\Y, C = l,x)Pr(Y\C = l,x), (21) 
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through the marginalization over Y, the probability Pr(Z\C = 1, x) can be obtained and 
we can express the relation between presences and covariates in terms of regression of Z 
respect to X. Notice that, while Pr(Y\C = l,x) can be obtained from (20), the term 
Pr(Z\Y, C = l,x), due to the conditional independence between Z and X given Y, simply 
reduce to be equal to Pr(Z\Y, C = 1) that can be derived from Table [2] 
We point out that, even if the response Y does not play an explicit role after the marginal- 
ization step, we need to keep it in the model as a hidden variable in order to obtain the 
approximation for the quantity n\ u = Yli^s Vii necessary to correct the linear regression 
function for presence-only data. 

Now, we can formalize the hierarchical Bayesian model to estimate the parameters of a 
linear logistic regression under the case-control scheme adjusted for presence-only data. 
In order to better explain the conditional relationship underlying the hierarchy, we intro- 
duce the graph in Figure [T] The dashed node indicates a variable hidden with respect to 
the conditional relationships. 




Figure 1: Graphical representation of the hierarchical Bayesian model. 



The priors. At the top of the hierarchy, we assume the hyper parameter 9 distributed 
as p(9). At the second level, we consider the prior probability distribution on (5 depending 
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on the hyper parameter 6, that is (3\8 ~ p(/3\6). At the third level, the unobserved data 
y u in S u are considered latent parameters with prior distribution Bernoulli (denoted by 



Be) with probability of occurrence given by the approximation in (20), that is 



2/i|C< = l,Xi,p ~ 5e ( r | , i G S u . 

This point is important for deriving the predictive distribution of the unobserved data y u 
necessary in the estimation algorithm. 

The likelihood. At the lowest level of the hierarchy, we have the likelihood, defined 

with respect to the observable stratum variable Z. Recalling that from the Table [2] we 

n 



have Pr(Z = 1\Y = 0, C = 1) = and Pr(Z = 1\Y = 1,C = 1) = p - — , when ([211 

n\ u + n p 

is marginalized over Y, one obtains the approximation 



Pr(Z = l\C = l,x,P) 



n 



p 



and hence 



exp{x/3 + log^g^} 

n lu + n p 1 + exp{x/3 + log ^±^} 
g| exp{a;/3} 



Pr(Z = 0|C = l,x,/3) = l-Pr(Z = l|C = l,x,/3) 

1 + expj^/S} 



(22) 



(23) 



1+ ( 1 + S) ex P^}' 
Thus, we can assume that for alH G 5 the conditional distribution of is Bernoulli with 



probability of occurence given by ( 22 ) , that is 



Z i \C i = l,x i ,P~Be\ ^— ? \ ,ieS. 

Recalling that Z\ = for all i G S u while Zi = l for all i G S'p, the likelihood function can 
be written as 

i(ftz , x) = n i±^gj x n . 

<" 1 + (l + St) exp{n/3} 1 + (l + St) npflD) 



Ward e£ a/. (2009) defines this function as the observed likelihood versus the full likelihood 
that, instead, considers the distribution of the stratum variable Z jointly with the response 
Y. 
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The posterior. Now, through the Bayes rule we derive the full posterior 

pOMKx) oc p(9)p(f3\9)m^) (24) 
that can be used to make inference on the quantities of interest. 



4 The MCMC computation. 



Samples from (24) can be obtained via Markov Chain Monte Carlo simulation (Robert 



and Casella, 


2004; 


Liu 


2008) 



pier for the vector (3 and the hyper parameter 9, we need to sample also the latent y u . For 



this reason we introduce a step of data augmentation (Tanner and Wong ? 1987, Tanner 



1996) in the estimation procedure. The basic idea of the data augmentation technique 



is to augment the set of observed data to a set of completed data that follow a simpler 



distribution (Liu and Wu, 1999). In our framework, we need to augment the observations 
of the stratum variable z with the missing values y u in order to have, at each itera- 
tion, a consistent value of the quantity n\ u , necessary to adjust the regression function 
4>pod(x) ~ x/3 + log ni ^~ np for presence-only data. The following result allows for an easy 



implementation of the data augmentation step. 



Proposition 2. Using the approximation (17) of the ratio (15), the posterior predictive 



probability of occurrence for an unobserved response y in the sub-sample S u is approxi- 
mated by the model A4 that generates the data at the population level, that is 



Pr{Y = 1\Z = 0,C = l,x) ^n*(x). 



(25) 



Proof. From the conditional independence between Z and X given Y, the predictive 
probability of occurrence in S u is given by 



Pr(Y = l\Z = 0,C = l,x) 



Pr(Z = 0\Y = 1,C = l)Pr(Y = l\C = l,x) 



Pr(Z = 0\C = l,x) 
From Table [2l we have that Pr(Z = 0\Y = 1, C = 1) = and hence 

n lu Pr{Y = l\C= l,x) 



PriY = 1\Z = 0,C = l,x) 



n p + niu Pr{Z = 0\C = 1, x) 
Now, recalling that in the general case one has 

(l + exp{0(x)} 
Pr {Y = l\C =l,x) V 1 



(26) 



l+(l + ^) exp{0(x)} 



(27) 
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and 



Pr(Z = Q\C = l,x) 



1 + exp{(j)(x)} 



l+(l + ^) exp{0(*» 



by substituting (27) and (28) in (26), one obtains 



Pr(Y = 1\Z = 0,C= l,x) 



n lu (l + ^)-P{^)} 



n p + n lu 1 + exp{(j)(x)} 
1 + exp{<j)(x)} 

7T*{X). 



(28) 



□ 



4.1 The data augmentation algorithm. 

A general MCMC scheme to perform inference on a linear regression model for presence- 
only data can be defined as follow. 



Step 0. Initialize 6, (3 and y u 
Step 1. Set n lu = J2i & s u V% 
Step 2. Sample 9 from p(0|z, x, /3) 
Step 3. Sample (3 from p(/3\z, x, 6) 

Step 4. Sample yi from p(yi\Zi = 0, Cj = 1, Xj, /9) for all i G 5, 
Goto Step 1 



After the initialization of all the arrays (Step 0), Step 1 sets a current value for the quantity 
n\ u to adjust the regression function (f> po d{x)- Step 2 and Step 3 consider the sampling from 
the posterior of the hyper parameter 9 and the regression parameter /3, respectively, and 
they can be performed by Metropolis-Hasting schemes (Robert and Casella, 2004). Step 
4 concerns the data augmentation for the unobserved y u in order to update consistently 
the quantity n\ u at the following iteration. From the result (25), this simulation can be 
obtained by a Gibbs sampler (Robert and Casella, 2004) since the posterior predictive 
distribution for all % £ S u is approximated by Bernoulli with parameter of occurrence 



TC(Xi 



l+exp{xi/3} ' 
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4.2 The estimation of the prevalence tt. 



From the data augmentation algorithm we can obtain a MCMC estimate of the population 
prevalence tt. In fact, if at each iteration t, after the Markov chain has reached the equi- 
librium, we save the current value n®, we can obtain a consistent MCMC approximation 
of the sample prevalence tt u in S u by 

TT ~ — (2V) 

''mcmc V / 

n u 

where fi\ u is the ergodic mean of the augmentations over the Markov chain, that is 

V T n {t) 



T 



Therefore, since ir u would be a consistent estimator for tt, TT mcmc represents also a consis- 
tent estimation of the empirical population prevalence. 



5 A comparative simulation study. 



We present a simulation experiment to evaluate the performances of the model (20). To 
this aim we generate several datasets in the way described below and we compare our 
proposal with respect to two models acting in two different situations: (a) the censorship 
process does not act on the population U so that the data y are completely observed; 
(b) the censorship is present, but we assume known the population prevalence so that 



approximation (16) can be used. In (a) we are able to estimate a linear logistic model 
(denoted by Mo), no correction is required and (j>o{x) = x(3. In (b) we consider a linear 
logistic model for presence-only data, denoted by M 1; with regression function (j)\{x) = 



x/3 + nn ™ n np . Model (20) (denoted by M 2 ) is estimated when the censorship process acts 
on the data and no information is available on the population prevalence. In this case, 
the regression function is given by <j>i(x) = x/3 + Ul ™ +np . Remark that model M 2 can be 
estimated when the least amount of information is available, Mi requires less information 
than M but more than M 2 and M can be used only in the ideal situation of complete 
information. We assume Mi as benchmark model in the case of presence-only data. 



The generation of data. In order to set the simulation study, we need to generate 
the covariates X and the binary response Y. In particular, we consider two covariates: 
X\, giving strong information on the distribution of the response Y, and X 2 , representing 
a term of noise, not available in the estimation step. We assume X\ distributed as a 
mixture of two Gaussian densities (denoted by N), centred in \i a = 4.0 and fib = —4.0 
respectively, and with equal variances a 2 = 4.0, that is 

Xi ~ wN a (fi a ; a 2 ) + (1 - w)N b (fi b ; a 2 ). 

The weight w is a realization of a Bernoulli random variable with probability of occurrence 
fixed to p — 0.165. X 2 has standard Gaussian distribution N(0, 1). Finally, the binary 
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response Y, given the covariates X\ and X 2 , is Bernoulli distributed with probability of 
occurrence 

_ exp{0 Q + Pixi + 2 x 2 } 
71 ~ 1 + exp{/3 + x x x + 2 x 2 } ' 

We generate covariates and binary response with respect to a population U of size N = 
10000. Three general scenarios with different level of complexity have been considered: 

(i) flo — 0, 0\ = 1, 2 = : only the informative covariate X\ generates the 

(ii) f3o = 0, 0\ = 1, 02 = 1 : a term of noise X 2 is added to the informative 
covariate; 

(iii) 0o — 1, Pi — 1, 02 — 1 : Xi, X 2 and a constant effect generate the data. 

The case-control sampling. For each scenario, we sample under the case-control de- 
sign with a ratio of presence/unobserved equal to 1 : 4 and with respect to eight different 
sample sizes 

n = 50, 100, 200, 500, 1000, 1500, 2000, 3000. 

For example, if the sample size is equal to n = 500, we build the corresponding simulated 
experiment by extracting a random sample S p from U p of n p = 100 presences and a 
random sample S u from U of n u = 400 unobserved values, covariates are available for 
the whole sample S. We consider k = 1000 independent replications of each experiment. 
In summary, we generate a database of 24,000 datasets (8 sample sizes, 3 scenarios and 
1000 replications). With respect to the generating of the data, we considered a quite 
general framework since the contribution of an informative covariate was combined with 
a constant effect and a white Gaussian noise. With respect to the three scenarios, we 
obtained empirical population prevalences respectively = 0.215, tt^) = 0.223 and 
-K(iH) = 0.286. 

The MCMC estimation. The estimation is performed in a Bayesian framework for all 
the models M , Mi and M 2 . The likelihood function we use in the estimation is based on 
a model that does not always replicate the model used to generate data. More precisely 
for all experiments (i), (ii) and (iii) the estimation model is: 

logit(Pr(F = 1\X 1 = xi)) = 0o + 0ix ± (30) 

than with scenario (i) the model that generates the data and the one defining the likelihood 
are the same, whilst for scenarios (ii) and (iii) the likelihood model becomes increasingly 
different from the one that generates the data. Notice that we consider a simpler structure 
than the one shown in Figure Q as we choose a Gaussian prior iV(0, 25) for all regression 
parameters (0 O , 0i) and no hyper parameter is considered. Then, MCMC estimates are 
computed using 5000 runs after 10000 iterations of burn-in, no thinning is applied as 
samples autocorrelation is negligible. 
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Results. In what follows we report Figures and Tables built on scenario (iii) as it rep- 
resents the most complex of the three alternatives and it is our "worst" case. In each 
replicate of an experiment, point estimates are computed as posterior means over 5000 
iterations. In Figure [2] boxplots describing point estimates behaviour are reported, hor- 
izontal lines corresponding to the "true" values are drawn. The first box corresponds to 
procedure M , the second to Mi and the third to our proposal M 2 . In M the prevalence 
7r is estimated as the ratio of the observed presences in S u to the sample size n u . In Mi, 
although 7T is assumed known a priori, we consider its posterior prediction in S u . Finally 
in M2, the prevalence is obtained at each MCMC step as described in section <12 and then 
the mean over 5000 runs is taken. In Table [3] further details of the point estimates are 
reported: the median and in parenthesis the first and third quartiles. From the Figures 
and the values we can see that the three procedure lead to "comparable" values with the 
obvious reduction of variability when n increases. Remark that the estimates for M 2 , 
although affected by a larger variability with small sample sizes, rapidly approaches Mq 
and Mi behaviour with increasing sample size. This can be seen more clearly in Figure [3] 
where rooted mean square errors (rmse) are reported. As far as Pi is concerned the lack 
of knowledge on X 2 leads to biased point estimates regardless the estimation procedure. 
Tables [5] and [6] in Appendix report point estimates for scenarios (i) and (ii). For scenarios 
(i) unbiased estimates are obtained while (ii) is affected by the same distortion as (iii) 
but with smaller variability. 
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n Model /3 Pi ir 



M 1.42 (0.68 ; 2.33) 1.15 (0.88 
50 Mi 3.13 (1.78 ; 4.46) 1.69 (1.17 
M 2 1.79 (-3.38 ; 4.26) 1.44 (0.76 


1.55) 0.28 (0.25 
2.28) 0.31 (0.26 
2.12) 0.24 (0.13 


0.35) 
0.35) 
0.34) 


M 1.14 (0.72 
100 Mi 2.12 (1.11 
M 2 1.92 (0.16 


1.62) 1.00 (0.86 
3.51) 1.30 (0.97 
3.87) 1.24 (0.89 


1.22) 0.29 (0.25 
1.80) 0.30 (0.26 
1.78) 0.28 (0.21 


0.33) 
0.34) 
0.36) 


M 1.01 (0.72 
200 Mi 1.53 (0.89 
M 2 1.49 (0.59 


1.36) 0.94 (0.83 
2.39) 1.08 (0.87 
2.62) 1.07 (0.83 


1.06) 0.29 (0.26 
1.37) 0.29 (0.27 
1.37) 0.29 (0.24 


0.31) 
0.32) 
0.34) 


M 0.94 (0.75 
500 Mi 1.12 (0.78 
M 2 1.17 (0.62 


1.15) 0.89 (0.82 
1.57) 0.94 (0.82 
1.82) 0.94 (0.80 


0.96) 0.29 (0.27 
1.10) 0.29 (0.28 
1.12) 0.29 (0.26 


0.30) 
0.31) 
0.32) 


M 0.91 (0.78 
1000 Mi 1.03 (0.79 
M 2 1.05 (0.68 


1.04) 0.88 (0.83 
1.34) 0.91 (0.83 
1.49) 0.91 (0.82 


0.92) 0.28 (0.28 
1.01) 0.29 (0.28 
1.03) 0.29 (0.27 


0.30) 
0.30) 
0.31) 


M 0.89 (0.80 
1500 Mi 1.00 (0.78 
M 2 1.01 (0.71 


1.00) 0.86 (0.83 
1.24) 0.89 (0.82 
1.35) 0.90 (0.82 


0.91) 0.29 (0.28 
0.98) 0.29 (0.28 
0.99) 0.29 (0.27 


0.29) 
0.30) 
0.31) 


M 0.89 (0.82 
2000 Mi 0.96 (0.79 
M 2 0.96 (0.71 


0.98) 0.87 (0.84 
1.15) 0.89 (0.83 
1.23) 0.88 (0.82 


0.90) 0.29 (0.28 
0.95) 0.29 (0.28 
0.96) 0.29 (0.27 


0.29) 
0.29) 
0.30) 


M 0.90 (0.83 
3000 Mi 0.94 (0.82 
M 2 0.95 (0.76 


0.97) 0.87 (0.84 
1.09) 0.88 (0.84 
1.17) 0.88 (0.83 


0.89) 0.29 (0.28 
0.93) 0.29 (0.28 
0.94) 0.29 (0.28 


0.29) 
0.29) 
0.30) 



Table 3: Scenario (iii): point estimates of regression parameters and prevalence computed 
as medians over 1000 replicates with increasing sample sizes and different models (M ,Mi 
and M 2 ). In parenthesis distributions quartiles are reported. 
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Figure 2: Scenario (iii): boxplots of simulations with increasing sample sizes and different 
models(M ,M! and M 2 ). 
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Figure 3: Scenario (iii): root mean squared errors for different models (Mo, Mi and M2) 
over the 1000 replications, plots with increasing sample sizes for (3q (a), fii (b) and 7r (c). 
Dashed trajectories are reported to show the patterns. 
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From Ward et al. (2009) we know that pairwise correlation between parameters is 
present. In Table [4] we report the empirical pairwise correlation measures, obtained as 
the averages with respect to the 1000 samples, with increasing sample sizes across the 
different models. No significant differences in the pattern of correlation fli) between 
the models Mi and M 2 while the correlation (/3i,7r) has a general weaker pattern in M\ 
than M 2 . With respect to the correlation (/3 , 7r) more significant difference are present 
between M\ and M 2 . In Figure [4] scatterplots of (3 versus tt in the 1000 replicates are 
plotted with equal axis across estimation procedures and sample sizes. These pictures help 
us to understand how this correlation evolves with increasing sample sizes. M 2 produces 
the most positive correlated point estimates this being an advantage whenever the model 
is properly specified. 



Model 




M 






Mi 






M 2 




n 


Pa; A 


Po;n 


f3i,TT 


Pol Pi 


Po',n 


Pi;* 


Pa; Pi 


Po;* 


Pi;* 


50 


0.65 


0.26 


-0.09 


0.59 


0.10 


-0.30 


0.68 


0.81 


0.31 


100 


0.75 


0.24 


-0.12 


0.89 


0.29 


0.02 


0.82 


0.76 


0.37 


200 


0.78 


0.34 


-0.04 


0.94 


0.39 


0.18 


0.90 


0.78 


0.48 


500 


0.79 


0.38 


0.00 


0.95 


0.41 


0.24 


0.92 


0.77 


0.51 


1000 


0.77 


0.38 


-0.01 


0.94 


0.46 


0.27 


0.91 


0.81 


0.54 


1500 


0.78 


0.42 


0.00 


0.95 


0.48 


0.28 


0.92 


0.81 


0.55 


2000 


0.77 


0.35 


-0.06 


0.94 


0.49 


0.30 


0.92 


0.81 


0.55 


3000 


0.81 


0.37 


-0.01 


0.95 


0.43 


0.23 


0.91 


0.80 


0.52 



Table 4: Scenario (iii): pairwise parameters correlation (average over the 1000 replicates) 
with increasing sample sizes and different models (Mo, Mi and M 2 ). 
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Figure 4: Scenario (iii): scatterplot of n versus (3 with increasing sample sizes and 
different models (M ,M 1 and M 2 ). 



To verify the predictive performance we considered relative measures of specificity and 
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build as the ratio of the same measures for M 2 (numerator) and 
for Mi (denominator) respectively. In Figure [5] the obtained values are reported versus 
sample sizes. Remark that M<i rapidly reaches the same level of performance as M\ with 
increasing sample size. 
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Figure 5: Scenario (iii): relative specificity and sensitivity computed as ratios between 
M2 and Mi specificity and sensitivity measures with increasing sample sizes. Dashed 
trajectories are reported to show the patterns. 



6 Conclusions 

In this work, we presented a Bayesian procedure to estimate the parameters of logistic 
regressions for presence-only data. The approach we proposed is based on a two levels 
scheme where a generating probability law is combined with a case-control design ad- 
justed for presence-only data. The new formalization allows to consider rigorously all the 



mathematical details of the model as for instance the approximation of the ratio (15) that 
represents the crucial point when modeling presence-only data in the finite population 
setting. We want to point out that our formalization is substantially different from the 



work by Ward et al. (2009), although we end with the same statistical model. We con- 
centrated on the case of the linear logistic regression because we were aware that some 
care is necessary to handle the identifiability issues present in the model. 
The comparative simulation study considered three scenarios with different levels of com- 
plexity across increasing sample sizes. We presented detailed results with respect to the 
most difficult case where the contribution of an informative covariate was mixed with a 
constant effect and a white Gaussian noise. In term of point estimation, the estimates 
based on our model were comparable to those obtained under the presence-only data 
benchmark in which the empirical population prevalence was assumed to be known. On 
the other hand, this lack of information on the population prevalence affected the effi- 
ciency of the estimates, that resulted smaller for our model than for the benchmark. This 
difference was significant only when the sample size n was smaller than 1000, i.e. when 
the number of observed presences n p was smaller than 200. From the predictive point 
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of view, our model performed as well as the benchmark already for sample sizes about 
n = 200, i.e. for a number of observed presences at least n p = 40. Also the pairwise 
correlation between /3q and tt, that represents an important issue as pointed by |Ward 



et al. (2009), became negligible with increasing sample sizes. 



From the computational point of view, the procedure were carried out through a MCMC 
scheme with data augmentation and implemented in Fortran codes. 

Future work will investigate the possibility of adding dependence structures among the 
population units into the model as, for instance, through the use of regression functions 
with structured random effects. 
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Appendix 



Proposition 3. Under the assumption that, given Y, the inclusion into the sample 
(C = 1) is independent from the covariates X, it results 

Pr (Y = 0\C = 1, x) Pr(C = l\x) = I ~ ^ Po 

and 

2tt*(t) 

Pr(Y = 1\C = l,x) Pr(C = l\x) = Pi- 



1 + 7T*(x) 



Proof. In general we have that 



^y\c^) ^ c ; r f:\^ (3D 



From the conditional independence between C = 1 and X given Y, the (31) becomes 

Pr(C = 1\Y) Pr(Y\x) 



Pr(Y\C = l,x) 



Pr(C = l\x) 



Recalling that PriY — l\x) — i+Jf^ and the definitions of po = Pr(C — 1\Y — 0) and 
px = Pr(C = 1\Y = 1) the proofs for Y = and Y = 1 can be derive by simple algebra. 



24 



n Model f3 fa ir 



M 0.40 (-0.31 ; 1.45) 1.56 (1.08 
50 Mi 2.19 (0.68 ; 3.57) 2.19 (1.37 
M 2 1.03 (-2.51 ; 3.35) 2.00 (1.07 


2.72) 0.20 (0.18 
3.74) 0.23 (0.18 
3.44) 0.19 (0.12 


0.25) 
0.27) 
0.26) 


M 0.31 (-0.18 ; 0.88) 1.23 (0.99 
100 Mi 1.24 (0.29 ; 2.36) 1.55 (1.12 
M 2 1.22 (-0.40 ; 2.69) 1.50 (1.07 


1.61) 0.21 (0.19 
2.22) 0.23 (0.19 
2.22) 0.16 (0.16 


0.25) 
0.26) 
0.27) 


M 0.11 (-0.20 
200 Mi 0.46 (0.00 ; 
M 2 0.48 (-0.24 


0.46) 1.08 (0.95 
1.27) 1.23 (0.99 
1.55) 1.23 (0.96 


1.28) 0.22 (0.19 
1.56) 0.22 (0.20 
1.59) 0.22 (0.19 


0.24) 
0.24) 
0.25) 


M 0.06 (-0.10 
500 Mi 0.17 (-0.09 
M 2 0.14 (-0.26 


0.25) 1.02 (0.94 
0.47) 1.04 (0.92 
0.61) 1.03 (0.89 


1.12) 0.22 (0.20 

1.19) 0.22 (0.20 

1.20) 0.22 (0.19 


0.23) 
0.23) 
0.24) 


M 0.04 (-0.05 
1000 Mi 0.04 (-0.13 
M 2 0.03 (-0.22 


0.17) 1.01 (0.95 
0.26) 0.99 (0.91 
0.34) 0.98 (0.90 


1.07) 0.22 (0.21 

1.08) 0.21 (0.21 

1.09) 0.21 (0.20 


0.22) 
0.22) 
0.23) 


M 0.05 (-0.04 
1500 Mi 0.01 (-0.12 
M 2 0.00 (-0.24 


0.15) 0.99 (0.95 
0.18) 0.97 (0.91 
0.23) 0.97 (0.90 


1.04) 0.21 (0.21 

1.05) 0.21 (0.21 
1.05) 0.21 (0.20 


0.22) 
0.22) 
0.22) 


M 0.03 (-0.04 
2000 Mi 0.00 (-0.12 
M 2 -0.02 (-0.22 


0.10) 0.99 (0.95 
0.14) 0.97 (0.92 
; 0.14) 0.96 (0.90 


1.03) 0.21 (0.21 
1.03) 0.21 (0.21 
1.02) 0.21 (0.20 


0.22) 
0.22) 
0.22) 


M 0.03 (-0.02 ; 0.10) 0.98 (0.96 
3000 Mi 0.00 (-0.10 ; 0.09) 0.96 (0.92 
M 2 -0.03 (-0.18 ; 0.11) 0.95 (0.91 


1.02) 0.21 (0.21 
1.00) 0.21 (0.21 
1.00) 0.21 (0.20 


0.22) 
0.22) 
0.22) 



Table 5: Scenario (i): point estimates of regression parameters and prevalence computed 
as medians over 1000 replicates with increasing sample sizes and different models (M ,Mi 
and M 2 ). In parenthesis distributions quartiles are reported. 
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n Model f3 fa it 



M 0.42 (-0.33 ; 1.42) 1.34 (0.94 
50 Mi 2.12 (0.78 ; 3.39) 1.95 (1.25 
M 2 1.26 (-2.99 ; 3.33) 1.75 (0.95 


2.13) 0.23 (0.18 
2.96) 0.24 (0.20 
2.79) 0.20 (0.13 


0.28) 
0.28) 
0.28) 


M 0.19 (-0.20 ; 0.73) 1.07 (0.88 
100 Mi 1.13 (0.36 ; 2.40) 1.39 (1.00 
M 2 1.03 (-0.40 ; 2.65) 1.34 (0.96 


1.35) 0.23 (0.19 
1.96) 0.23 (0.21 
1.96) 0.22 (0.17 


0.25) 
0.26) 
0.28) 


M 0.13 (-0.18 
200 Mi 0.48 (0.02 ; 
M 2 0.48 (-0.38 


0.45) 0.97 (0.84 
1.17) 1.08 (0.88 
1.56) 1.07 (0.83 


1.12) 0.23 (0.20 
1.36) 0.23 (0.21 
1.41) 0.23 (0.18 


0.25) 
0.25) 
0.27) 


M 0.09 (-0.07 
500 Mi 0.22 (-0.04 
M 2 0.23 (-0.24 


0.27) 0.92 (0.85 
0.53) 0.95 (0.84 
0.69) 0.95 (0.81 


1.00) 0.22 (0.21 
1.09) 0.23 (0.21 
1.09) 0.22 (0.20 


0.24) 
0.24) 
0.25) 


M 0.08 (-0.02 
1000 Mi 0.09 (-0.07 
M 2 0.08 (-0.19 


0.20) 0.90 (0.86 
0.31) 0.90 (0.83 
0.38) 0.89 (0.82 


0.95) 0.22 (0.21 
0.99) 0.22 (0.21 
1.00) 0.22 (0.21 


0.23) 
0.23) 
0.24) 


M 0.08 (0.00 ; 0.18) 0.90 (0.86 
1500 Mi 0.08 (-0.05 ; 0.23) 0.89 (0.84 
M 2 0.05 (-0.17 ; 0.30) 0.89 (0.82 


0.94) 0.22 (0.22 
0.96) 0.22 (0.22 
0.96) 0.22 (0.21 


0.23) 
0.23) 
0.23) 


M 0.07 (0.00 ; 0.15) 0.89 (0.86 
2000 Mi 0.06 (-0.06 ; 0.20) 0.89 (0.84 
M 2 0.02 (-0.17 ; 0.24) 0.88 (0.82 


0.92) 0.22 (0.22 
0.95) 0.22 (0.22 
0.94) 0.22 (0.21 


0.23) 
0.23) 
0.23) 


M 0.07 (0.01 ; 0.13) 0.89 (0.87 
3000 Mi 0.04 (-0.04 ; 0.15) 0.88 (0.84 
M 2 0.02 (-0.13 ; 0.17) 0.87 (0.83 


0.91) 0.22 (0.22 
0.92) 0.22 (0.22 
0.92) 0.22 (0.21 


0.23) 
0.23) 
0.23) 



Table 6: Scenario (ii): point estimates of regression parameters and prevalence computed 
as medians over 1000 replicates with increasing sample sizes and different models (M ,Mi 
and M 2 ). In parenthesis distributions quartiles are reported. 
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